library(ggplot2)
library(dplyr)
library(ggfortify)
library(factoextra)
library(ICSNP)
library(MASS)
library(glmnet)
gap.raw <- read.csv('./gap.csv')
gap <- gap.raw
gap[,3:14]<- log(gap.raw[,3:14])
gdp <- (gap[,3:14])
years <- seq(1952, 2007,5)
colnames(gdp) <- years
rownames(gdp) <- gap[,2]
Note: log of the gap is called gdp and it is used for the analysis.
lifeExp <- gap[,15:26]
colnames(lifeExp) <- years
rownames(lifeExp) <- gap[,2]
plot(years, colMeans(exp(gdp)), type='l', xlab='Year', ylab='Average GDP', main='Average GDP over time')
gdp.cont <- aggregate(gdp, by=list(gap$continent), FUN=mean)
plot(years, gdp.cont[1,2:13], type='l', xlab='Year', ylab='Average GDP (log)', main='GDP (log) by continent over time', col=1, ylim=c(6.5, 10.5))
for(i in 2:6){
lines(years, gdp.cont[i,2:13], col=i)
}
legend('bottomright', legend=gdp.cont[,1], col=1:6, lty=1)
plot(years, colMeans(lifeExp), type='l', xlab='Year', ylab='Average Life Expectancy', main='Average Life Expectancy over time')
lifeExp.cont <- aggregate(lifeExp, by=list(gap$continent), FUN=mean)
plot(years, lifeExp.cont[1,2:13], type='l', xlab='Year', ylab='Average Life Expectancy', main='Life Expectancy by continent over time', col=1, ylim=c(30, 80))
for(i in 2:6){
lines(years, lifeExp.cont[i,2:13], col=i)
}
legend('bottomright', legend=lifeExp.cont[,1], col=1:6, lty=1)
gdp.pca.S <- prcomp(gdp, scale=FALSE)
gdp.pca.S.scores <- gdp.pca.S$x
summary(gdp.pca.S)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 4.0608 0.8718 0.38139 0.2215 0.16930 0.11366 0.09388
## Proportion of Variance 0.9416 0.0434 0.00831 0.0028 0.00164 0.00074 0.00050
## Cumulative Proportion 0.9416 0.9850 0.99328 0.9961 0.99771 0.99845 0.99895
## PC8 PC9 PC10 PC11 PC12
## Standard deviation 0.07325 0.06644 0.05813 0.05649 0.04419
## Proportion of Variance 0.00031 0.00025 0.00019 0.00018 0.00011
## Cumulative Proportion 0.99926 0.99951 0.99971 0.99989 1.00000
cum.variance.S <- summary(gdp.pca.S)$importance[3,]
variance.S <- summary(gdp.pca.S)$importance[2,]
plot(cum.variance.S, type='l', xlab='Number of Principal Components', ylab='Cumulative Variance', main='Cumulative Variance of GDP (log)')
Plotting PCA scores
plot(gdp.pca.S, main='PCA Scores of GDP (log)')
According to the plot above, the first three principal components explain 99.3% of the variance in the data. Therefore, we will use the first three principal components for the analysis.
According to the negative and positive values of the principal components, it would be hard to interpret the results directly. Therefore we are going to plot them against each other to see if there is any pattern.
gdp.pca.S.scores <- data.frame(gdp.pca.S.scores)
gdp.pca.S.scores$country <- rownames(gdp.pca.S.scores)
gdp.pca.S.scores$continent <- gap$continent
plt <- ggplot(gdp.pca.S.scores, aes(PC1, PC2, colour=continent))
plt <- plt + geom_point()
plt <- plt + geom_text(aes(label=country), size=2)
plt <- plt + ggtitle('GDP PCA', subtitle='PC1 vs PC2')
plt <- plt + theme(
plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5)
)
plt <- plt + xlab(paste('PC1 (', round(variance.S[1]*100, 2), '%)', sep=''))
plt <- plt + ylab(paste('PC2 (', round(variance.S[2]*100, 2), '%)', sep=''))
print(plt)
plt <- ggplot(gdp.pca.S.scores, aes(PC1, PC3, colour=continent))
plt <- plt + geom_point()
plt <- plt + geom_text(aes(label=country), size=2)
plt <- plt + ggtitle('GDP PCA', subtitle='PC1 vs PC3')
plt <- plt + theme(
plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5)
)
plt <- plt + xlab(paste('PC1 (', round(variance.S[1]*100, 2), '%)', sep=''))
plt <- plt + ylab(paste('PC3 (', round(variance.S[3]*100, 2), '%)', sep=''))
print(plt)
plt <- ggplot(gdp.pca.S.scores, aes(PC2, PC3, colour=continent))
plt <- plt + geom_point()
plt <- plt + geom_text(aes(label=country), size=2)
plt <- plt + ggtitle('GDP PCA', subtitle='PC2 vs PC3')
plt <- plt + theme(
plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5)
)
plt <- plt + xlab(paste('PC2 (', round(variance.S[2]*100, 2), '%)', sep=''))
plt <- plt + ylab(paste('PC3 (', round(variance.S[3]*100, 2), '%)', sep=''))
print(plt)
As we can see from the plots, the first two principal component could be used to separate the countries based on their continent. e.g. African countries are mostly on the right side of the plot, while European and Oceania countries are mostly on the left side of the plot.
The same result can be shown in the plot of PC1-PC3. However, the plot of PC2-PC3 shows that there is no clear separation between the countries based on their continent.
Between these countries, Kuwait behaves differently from the other countries. It is located far from other countries with low PC1 and high PC2 and PC3. This is because Kuwait has a very high GDP compared to other countries.
lifeExp.pca.S <- prcomp(lifeExp, scale=FALSE)
lifeExp.pca.S.scores <- lifeExp.pca.S$x
summary(lifeExp.pca.S)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 38.6350 9.83955 4.50873 2.52549 1.39085 1.27733 1.01896
## Proportion of Variance 0.9203 0.05969 0.01253 0.00393 0.00119 0.00101 0.00064
## Cumulative Proportion 0.9203 0.97994 0.99247 0.99641 0.99760 0.99860 0.99924
## PC8 PC9 PC10 PC11 PC12
## Standard deviation 0.79691 0.48377 0.43319 0.33836 0.23486
## Proportion of Variance 0.00039 0.00014 0.00012 0.00007 0.00003
## Cumulative Proportion 0.99964 0.99978 0.99990 0.99997 1.00000
cum.variance.S <- summary(lifeExp.pca.S)$importance[3,]
variance.S <- summary(lifeExp.pca.S)$importance[2,]
plot(cum.variance.S, type='l', xlab='Number of Principal Components', ylab='Cumulative Variance', main='Cumulative Variance of Life Expectancy')
Plotting PCA scores
plot(lifeExp.pca.S, main='PCA Scores of Life Expectancy')
According to the plot above, the first three principal components explain 99.2% of the variance in the data. Therefore, we will use the first three principal components for the analysis.
lifeExp.pca.S.scores <- data.frame(lifeExp.pca.S.scores)
lifeExp.pca.S.scores$country <- rownames(lifeExp.pca.S.scores)
lifeExp.pca.S.scores$continent <- gap$continent
plt <- ggplot(lifeExp.pca.S.scores, aes(PC1, PC2, colour=continent))
plt <- plt + geom_point()
plt <- plt + geom_text(aes(label=country), size=2)
plt <- plt + ggtitle('Life Expectancy PCA', subtitle='PC1 vs PC2')
plt <- plt + theme(
plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5)
)
plt <- plt + xlab(paste('PC1 (', round(variance.S[1]*100, 2), '%)', sep=''))
plt <- plt + ylab(paste('PC2 (', round(variance.S[2]*100, 2), '%)', sep=''))
print(plt)
plt <- ggplot(lifeExp.pca.S.scores, aes(PC1, PC3, colour=continent))
plt <- plt + geom_point()
plt <- plt + geom_text(aes(label=country), size=2)
plt <- plt + ggtitle('Life Expectancy PCA', subtitle='PC1 vs PC3')
plt <- plt + theme(
plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5)
)
plt <- plt + xlab(paste('PC1 (', round(variance.S[1]*100, 2), '%)', sep=''))
plt <- plt + ylab(paste('PC3 (', round(variance.S[3]*100, 2), '%)', sep=''))
print(plt)
plt <- ggplot(lifeExp.pca.S.scores, aes(PC2, PC3, colour=continent))
plt <- plt + geom_point()
plt <- plt + geom_text(aes(label=country), size=2)
plt <- plt + ggtitle('Life Expectancy PCA', subtitle='PC2 vs PC3')
plt <- plt + theme(
plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5)
)
plt <- plt + xlab(paste('PC2 (', round(variance.S[2]*100, 2), '%)', sep=''))
plt <- plt + ylab(paste('PC3 (', round(variance.S[3]*100, 2), '%)', sep=''))
print(plt)
As we can see from the plots, the first two principal component could be used to separate the countries based on their continent. e.g. African countries are mostly on the left side of the plot, while European and Oceania countries are mostly on the right side of the plot.
The same result can be shown in the plot of PC1-PC3. However, the plot of PC2-PC3 shows that there is no clear separation between the countries based on their continent.
Between these countries, Zimbabwe, and Zambia behave differently from the other countries. They are located far from other countries with low high PC2 and PC3. This is because these countries have a very low life expectancy compared to other countries.
Performing MDS on the combined dataset of GDP and Life Expectancy.
combined.ds <- gap[,3:26]
n <- nrow(combined.ds)
H <- diag(rep(1,n))-rep(1,n)%*%t(rep(1,n))/n
combined.ds <- H %*% as.matrix(combined.ds)
distances <- dist(combined.ds)
mds <- cmdscale(distances, k=2)
Now we will plot a 2-dimensional representation of the data.
mds <- data.frame(mds)
mds$country <- gap$country
mds$continent <- gap$continent
plt <- ggplot(mds, aes(X1, X2, colour=continent))
plt <- plt + geom_point()
plt <- plt + geom_text(aes(label=country), size=2)
plt <- plt + ggtitle('Multidimensional scaling')
plt <- plt + theme(
plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5)
)
plt <- plt + xlab('V1')
plt <- plt + ylab('V2')
print(plt)
The plot above shows that the countries are separated based on their continent. e.g. African countries are mostly on the left side of the plot, while European and Oceania countries are mostly on the right side of the plot same as what we saw in the PCA plot of life expectancy.
Now we will conduct a multivariate hypothesis test for year 2007 and 1952 for Asian and European countries. The null hypothesis is that the mean of the GDP/life expectancy of the countries in each mentioned continent is the same. The alternative hypothesis is that the mean of the GDP/life expectancy of the countries in the continent is different.
gdp.2007 <- gdp %>% dplyr::select('2007')
lifeExp.2007 <- lifeExp %>% dplyr::select('2007')
gdp.2007$continent <- gap$continent
lifeExp.2007$continent <- gap$continent
gdp.2007.asia <- data.frame(gdp.2007[gdp.2007$continent == 'Asia',1])
lifeExp.2007.asia <- data.frame(lifeExp.2007[lifeExp.2007$continent == 'Asia',1])
gdp.2007.europe <- data.frame(gdp.2007[gdp.2007$continent == 'Europe',1])
lifeExp.2007.europe <- data.frame(lifeExp.2007[lifeExp.2007$continent == 'Europe',1])
gdp.1952 <- gdp %>% dplyr::select('1952')
lifeExp.1952 <- lifeExp %>% dplyr::select('1952')
gdp.1952$continent <- gap$continent
lifeExp.1952$continent <- gap$continent
gdp.1952.asia <- data.frame(gdp.1952[gdp.1952$continent == 'Asia',1])
lifeExp.1952.asia <- data.frame(lifeExp.1952[lifeExp.1952$continent == 'Asia',1])
gdp.1952.europe <- data.frame(gdp.1952[gdp.1952$continent == 'Europe',1])
lifeExp.1952.europe <- data.frame(lifeExp.1952[lifeExp.1952$continent == 'Europe',1])
HotellingsT2(gdp.2007.asia, gdp.2007.europe)
##
## Hotelling's two sample T2-test
##
## data: gdp.2007.asia and gdp.2007.europe
## T.2 = 25.215, df1 = 1, df2 = 61, p-value = 4.747e-06
## alternative hypothesis: true location difference is not equal to c(0)
qf(0.95, 1, 61)
## [1] 3.998494
The Hotelling’s T2 test statistic is 25.215 and the F-statistic is 3.99. so we reject the null hypothesis. This means that the mean of the GDP of the countries in Asia and Europe in 2007 is different.
Let’s compare these two continents in 1952.
HotellingsT2(gdp.1952.asia, gdp.1952.europe)
##
## Hotelling's two sample T2-test
##
## data: gdp.1952.asia and gdp.1952.europe
## T.2 = 24.088, df1 = 1, df2 = 61, p-value = 7.196e-06
## alternative hypothesis: true location difference is not equal to c(0)
qf(0.95, 1, 61)
## [1] 3.998494
The Hotelling’s T2 test statistic is 24 and the F-statistic is 3.99. so we reject the null hypothesis. This means that the mean of the GDP of the countries in Asia and Europe in 1952 is different. Although they are more similar in 1952 than 2007.
HotellingsT2(lifeExp.2007.asia, lifeExp.2007.europe)
##
## Hotelling's two sample T2-test
##
## data: lifeExp.2007.asia and lifeExp.2007.europe
## T.2 = 20.072, df1 = 1, df2 = 61, p-value = 3.34e-05
## alternative hypothesis: true location difference is not equal to c(0)
qf(0.95, 1, 61)
## [1] 3.998494
The Hotelling’s T2 test statistic is 20 and the F-statistic is 3.99. so we reject the null hypothesis. This means that the mean of the Life Expectancy of the countries in Asia and Europe in 2007 is different.
Let’s compare these two continents in 1952.
HotellingsT2(lifeExp.1952.asia, lifeExp.1952.europe)
##
## Hotelling's two sample T2-test
##
## data: lifeExp.1952.asia and lifeExp.1952.europe
## T.2 = 79.73, df1 = 1, df2 = 61, p-value = 1.129e-12
## alternative hypothesis: true location difference is not equal to c(0)
qf(0.95, 1, 61)
## [1] 3.998494
The Hotelling’s T2 test statistic is 79.73 and the F-statistic is 3.99. so we reject the null hypothesis. This means that the mean of the Life Expectancy of the countries in Asia and Europe in 1952 is different. Although they are more similar in 2007 than 1952 which is the opposite of what we saw in the GDP test.
We will now look at whether linear discriminant analysis can be used to successfully separate the continents. We will use lda to predict the continent of a country based on its GDP and life expectancy.
set.seed(1221)
# combine gdp and life expectancy
combined.ds <- gap[,3:26]
n <- nrow(combined.ds)
test_index <- sample(1:n, 0.2*n)
train <- combined.ds[-test_index,]
test <- combined.ds[test_index,]
lda.model.train <- lda(train, gap$continent[-test_index])
lda.model.test <- predict(lda.model.train, test)
print(paste('LDA accuracy:', sum(lda.model.test$class == gap$continent[test_index])/length(lda.model.test$class)*100, '%', sep=''))
## [1] "LDA accuracy:64.2857142857143%"
table(lda.model.test$class, gap$continent[test_index])
##
## Africa Americas Asia Europe
## Africa 8 0 1 0
## Americas 0 1 1 0
## Asia 0 2 5 0
## Europe 0 4 2 4
## Oceania 0 0 0 0
proj <- predict(lda.model.train, type="proba")$x[, 1:2]
proj.df <- data.frame(proj, continent = gap$continent[-test_index])
ggplot(proj.df, aes(x = LD1, y = LD2, color = continent)) +
geom_point() +
labs(x = "LD1", y = "LD2", color = "Continent")
proj <- predict(lda.model.train, test, type="proba")$x[, 1:2]
proj.df <- data.frame(proj, continent = gap$continent[test_index])
ggplot(proj.df, aes(x = LD1, y = LD2, color = continent)) +
geom_point() +
labs(x = "LD1", y = "LD2", color = "Continent")
As it can be seen from the plots above, the LDA model is able to separate the continents based on the GDP and life expectancy of the countries. The plots are somehow similar to the PCA plots we saw earlier. But it seems Life Expectancy is more important than GDP in separating the continents in this case.
We will now use K-Means clustering to cluster the countries based on their GDP and life expectancy.
set.seed(1221)
gdp.kmeans <- kmeans(gdp, centers = 3)
table(gdp.kmeans$cluster, gap$continent)
##
## Africa Americas Asia Europe Oceania
## 1 2 5 8 24 2
## 2 37 1 13 0 0
## 3 13 19 12 6 0
fviz_cluster(gdp.kmeans, data = gdp, ellipse.type = "norm",
palette = "jco", ggtheme = theme_minimal())
fviz_nbclust(gdp, kmeans, method = "wss")
set.seed(1221)
lifeExp.kmeans <- kmeans(lifeExp, centers = 3)
table(lifeExp.kmeans$cluster, gap$continent)
##
## Africa Americas Asia Europe Oceania
## 1 2 14 9 29 2
## 2 34 1 5 0 0
## 3 16 10 19 1 0
fviz_cluster(lifeExp.kmeans, data = lifeExp, ellipse.type = "norm",
palette = "jco", ggtheme = theme_minimal())
fviz_nbclust(lifeExp, kmeans, method = "wss")
As we can see, we use the elbow method to find the optimal number of clusters. For GDP, the optimal number of clusters is 3 and for life expectancy, it is also 3. The clusters are not very similar to the continents.
gap.scaled <- gap
gap.scaled[,3:26] <- scale(gap[,3:26])
gap.dist <- dist(gap.scaled[,3:26], method = "euclidean")
plot(hclust(gap.dist, method = "complete"), hang = -1, cex = 0.6, main = "Complete Linkage Clustering", labels = gap$country)
plot(hclust(gap.dist, method = "single"), hang = -1, cex = 0.6, main = "Single Linkage Clustering", labels = gap$country)
plot(hclust(gap.dist, method = "average"), hang = -1, cex = 0.6, main = "Average Linkage Clustering", labels = gap$country)
Plots above show the hierarchical clustering of the countries based on their GDP and life expectancy. The complete linkage clustering seems to be the best one. The clusters are not very similar to the continents.
We will now use linear regression to predict 2007 life expectancy from gdp values.
lifeExp.2007 <- lifeExp %>% dplyr::select('2007')
df <- cbind(gdp, lifeExp.2007)
colnames(df)[ncol(df)] <- 'target'
set.seed(1221)
n <- nrow(df)
test_index <- sample(1:n, 0.2*n)
train <- df[-test_index,]
test <- df[test_index,]
model.lm <- lm(target ~ ., data = train)
summary(model.lm)
##
## Call:
## lm(formula = target ~ ., data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.9286 -2.1059 0.5705 3.8847 12.0786
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.5357 5.4798 0.645 0.52024
## `1952` -2.8641 7.2310 -0.396 0.69287
## `1957` 12.1431 11.0003 1.104 0.27227
## `1962` -8.1736 12.0940 -0.676 0.50069
## `1967` 3.0511 8.7295 0.350 0.72743
## `1972` -8.0452 7.5957 -1.059 0.29204
## `1977` 0.7646 6.7455 0.113 0.90997
## `1982` -1.5963 8.6317 -0.185 0.85365
## `1987` 8.2826 7.3898 1.121 0.26502
## `1992` -11.2368 6.7644 -1.661 0.09978 .
## `1997` 22.5288 8.3505 2.698 0.00818 **
## `2002` -15.3694 9.1163 -1.686 0.09490 .
## `2007` 8.0100 5.3722 1.491 0.13907
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.139 on 101 degrees of freedom
## Multiple R-squared: 0.7003, Adjusted R-squared: 0.6647
## F-statistic: 19.67 on 12 and 101 DF, p-value: < 2.2e-16
pred <- predict(model.lm, test)
print(paste('RMSE:', sqrt(mean((pred - test$target)^2))))
## [1] "RMSE: 6.40057331119658"
The RMSE is 6.4 for the test set for linear regression model.
Let’s try Ridge regression to see if we can improve the RMSE.
train.matrix <- as.matrix(train[,-ncol(train)])
test.matrix <- as.matrix(test[,-ncol(test)])
train.target <- train$target
test.target <- test$target
model.ridge <- glmnet(train.matrix, train.target, alpha = 0)
plot(model.ridge, xvar = "lambda", label = FALSE)
lambdas <- 10^seq(3,-2,by=-0.1)
model.ridge.cv <- cv.glmnet(train.matrix, train.target, alpha = 0, lambda = lambdas)
plot(model.ridge.cv)
min.lambda <- model.ridge.cv$lambda.min
min.1se.lambda <- model.ridge.cv$lambda.1se
print(paste('Min lambda:', min.lambda))
## [1] "Min lambda: 1.99526231496888"
print(paste('Min 1se lambda:', min.1se.lambda))
## [1] "Min 1se lambda: 39.8107170553497"
coef <- coef(model.ridge.cv, s = model.ridge.cv$lambda.min)
print(paste('Number of coefficients:', length(coef[coef != 0])))
## <sparse>[ <logic> ]: .M.sub.i.logical() maybe inefficient
## [1] "Number of coefficients: 13"
pred <- predict(model.ridge.cv, test.matrix, s = model.ridge.cv$lambda.min)
print(paste('RMSE:', sqrt(mean((pred - test.target)^2))))
## [1] "RMSE: 6.34290896051329"
The RMSE is 6.34 for the test set for Ridge regression model. It is a bit better than the linear regression model. The number of coefficients is 13 which is the same as the number of features. This means that the Ridge regression model is not able to reduce the number of features. Let’s try Lasso regression to see if we can improve the RMSE.
model.lasso <- glmnet(train.matrix, train.target, alpha = 1)
plot(model.lasso, xvar = "lambda", label = FALSE)
lambdas <- 10^seq(3,-2,by=-0.1)
model.lasso.cv <- cv.glmnet(train.matrix, train.target, alpha = 1, lambda = lambdas)
plot(model.lasso.cv)
min.lambda <- model.lasso.cv$lambda.min
min.1se.lambda <- model.lasso.cv$lambda.1se
print(paste('Min lambda:', min.lambda))
## [1] "Min lambda: 0.398107170553497"
print(paste('Min 1se lambda:', min.1se.lambda))
## [1] "Min 1se lambda: 3.16227766016838"
coef <- coef(model.lasso.cv, s = model.lasso.cv$lambda.min)
print(paste('Number of coefficients:', length(coef[coef != 0])))
## <sparse>[ <logic> ]: .M.sub.i.logical() maybe inefficient
## [1] "Number of coefficients: 3"
pred <- predict(model.lasso.cv, test.matrix, s = model.lasso.cv$lambda.min)
print(paste('RMSE:', sqrt(mean((pred - test.target)^2))))
## [1] "RMSE: 6.27065071067278"
The RMSE is 6.27 for the test set for Lasso regression model. It is better than the other models. The number of coefficients is 3 which is much less than the number of features. This means that the Lasso regression model is able to reduce the number of features.
In this project, we have analyzed the Gapminder dataset. We have found that the life expectancy has increased over the years. We have also found that the GDP has increased over the years. We have found that the life expectancy is correlated with the GDP. We have also found that the life expectancy is correlated with the GDP per capita.
Good Luck!